{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Benchmarks for Token-Leen String Operations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will be conducting benchmarks on a real-world dataset of English words. Feel free to replace with your favorite dataset :)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "File ‘../leipzig1M.txt’ already there; not retrieving.\n"
     ]
    }
   ],
   "source": [
    "!wget --no-clobber -O ../leipzig1M.txt https://introcs.cs.princeton.edu/python/42sort/leipzig1m.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "171,285,845 words\n"
     ]
    }
   ],
   "source": [
    "text = open(\"../xlsum.csv\", \"r\").read(1024 * 1024 * 1024)\n",
    "words = text.split()\n",
    "words = tuple(words)\n",
    "print(f\"{len(words):,} words\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('to', 2),\n",
       " ('Châu,', 6),\n",
       " ('возможность', 22),\n",
       " ('la', 2),\n",
       " (\"doesn't\", 7),\n",
       " ('सकता', 12),\n",
       " ('and', 3),\n",
       " ('Interestingly,', 14),\n",
       " ('have', 4),\n",
       " ('Зрители', 14)]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import random\n",
    "\n",
    "word_examples = random.choices(words, k=10)\n",
    "word_lengths = [len(s.encode('utf-8')) for s in word_examples]\n",
    "\n",
    "list(zip(word_examples, word_lengths))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hashing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Throughput"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's chack how long it takes the default Python implementation to hash the entire dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.09 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit -n 1 -r 1\n",
    "text.__hash__()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "9.75 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit -n 1 -r 1\n",
    "for word in words: word.__hash__()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's compare to StringZilla's implementation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "import stringzilla as sz\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7.69 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit -n 1 -r 1\n",
    "sz.hash(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "13.7 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit -n 1 -r 1\n",
    "for word in words: sz.hash(word)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Quality and Collisions Frequency"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One of the most important qualities of the hash function is it's resistence to collisions. Let's check how many collisions we have in the dataset.\n",
    "For that, we will create a bitset using NumPy with more than `len(word)` bits for each word in the dataset. Then, we will hash each word and set the corresponding bit in the bitset. Finally, we will count the number of set bits in the bitset. The more empty spots are left in the bitset, the weaker is the function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "def count_populated(words, hasher) -> int:\n",
    "    slots_count = len(words) * 2\n",
    "    bitset = np.zeros(slots_count, dtype=bool)\n",
    "\n",
    "    # Hash each word and set the corresponding bit in the bitset\n",
    "    for word in words:\n",
    "        hash_value = hasher(word) % slots_count\n",
    "        bitset[hash_value] = True\n",
    "\n",
    "    # Count the number of set bits\n",
    "    return np.sum(bitset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7,982,184 unique words\n"
     ]
    }
   ],
   "source": [
    "unique_words = set(words)\n",
    "print(f\"{len(unique_words):,} unique words\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collisions for `hash`: 92,147 ~ 1.1544%\n"
     ]
    }
   ],
   "source": [
    "populated_default = count_populated(words, hash)\n",
    "collisions_default = len(unique_words) - populated_default\n",
    "print(f\"Collisions for `hash`: {collisions_default:,} ~ {collisions_default / len(unique_words):.4%}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collisions for `sz.hash`: 97,183 ~ 1.2175%\n"
     ]
    }
   ],
   "source": [
    "populated_stringzilla = count_populated(words, sz.hash)\n",
    "collisions_stringzilla = len(unique_words) - populated_stringzilla\n",
    "print(f\"Collisions for `sz.hash`: {collisions_stringzilla:,} ~ {collisions_stringzilla / len(unique_words):.4%}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Base10 Numbers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Benchmarks on small datasets may not be very representative. Let's generate 4 Billion unique strings of different length and check the quality of the hash function on them. To make that efficient, let's define a generator expression that will generate the strings on the fly. Each string is a printed integer representation from 0 to 4 Billion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "def count_populated_synthetic(make_generator, n, hasher) -> int:\n",
    "    slots_count = n * 2\n",
    "    bitset = np.zeros(slots_count, dtype=bool)\n",
    "\n",
    "    # Hash each word and set the corresponding bit in the bitset\n",
    "    for word in make_generator(n):\n",
    "        hash_value = hasher(word) % (slots_count)\n",
    "        bitset[hash_value] = True\n",
    "\n",
    "    # Count the number of set bits\n",
    "    return np.sum(bitset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "n = 256 * 1024 * 1024"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_printed_numbers_until(n):\n",
    "    \"\"\"Generator expression to yield strings of printed integers from 0 to n.\"\"\"\n",
    "    for i in range(n):\n",
    "        yield str(i)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collisions for `hash`: 57,197,029 ~ 21.3076%\n"
     ]
    }
   ],
   "source": [
    "populated_default = count_populated_synthetic(generate_printed_numbers_until, n, hash)\n",
    "collisions_default = n - populated_default\n",
    "print(f\"Collisions for `hash`: {collisions_default:,} ~ {collisions_default / n:.4%}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collisions for `sz.hash`: 57,820,385 ~ 21.5398%\n"
     ]
    }
   ],
   "source": [
    "populated_sz = count_populated_synthetic(generate_printed_numbers_until, n, sz.hash)\n",
    "collisions_sz = n - populated_sz\n",
    "print(f\"Collisions for `sz.hash`: {collisions_sz:,} ~ {collisions_sz / n:.4%}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Base64 Numbers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "import base64\n",
    "\n",
    "def int_to_base64(n):\n",
    "    byte_length = (n.bit_length() + 7) // 8\n",
    "    byte_array = n.to_bytes(byte_length, 'big')\n",
    "    base64_string = base64.b64encode(byte_array)\n",
    "    return base64_string.decode() \n",
    "\n",
    "def generate_base64_numbers_until(n):\n",
    "    \"\"\"Generator expression to yield strings of printed integers from 0 to n.\"\"\"\n",
    "    for i in range(n):\n",
    "        yield int_to_base64(i)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collisions for `hash`: 57,197,478 ~ 21.3077%\n"
     ]
    }
   ],
   "source": [
    "populated_default = count_populated_synthetic(generate_base64_numbers_until, n, hash)\n",
    "collisions_default = n - populated_default\n",
    "print(f\"Collisions for `hash`: {collisions_default:,} ~ {collisions_default / n:.4%}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collisions for `sz.hash`: 224,998,905 ~ 83.8186%\n"
     ]
    }
   ],
   "source": [
    "populated_sz = count_populated_synthetic(generate_base64_numbers_until, n, sz.hash)\n",
    "collisions_sz = n - populated_sz\n",
    "print(f\"Collisions for `sz.hash`: {collisions_sz:,} ~ {collisions_sz / n:.4%}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Base256 Representations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_base256_numbers_until(n) -> bytes:\n",
    "    \"\"\"Generator a 4-byte long binary string wilth all possible values of `uint32_t` until value `n`.\"\"\"\n",
    "    for i in range(n):\n",
    "        yield i.to_bytes(4, byteorder='big')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collisions for `hash`: 57,195,602 ~ 21.3070%\n"
     ]
    }
   ],
   "source": [
    "populated_default = count_populated_synthetic(generate_base256_numbers_until, n, hash)\n",
    "collisions_default = n - populated_default\n",
    "print(f\"Collisions for `hash`: {collisions_default:,} ~ {collisions_default / n:.4%}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collisions for `sz.hash`: 59,905,848 ~ 22.3167%\n"
     ]
    }
   ],
   "source": [
    "populated_sz = count_populated_synthetic(generate_base256_numbers_until, n, sz.hash)\n",
    "collisions_sz = n - populated_sz\n",
    "print(f\"Collisions for `sz.hash`: {collisions_sz:,} ~ {collisions_sz / n:.4%}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Protein Sequences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Benchmarks on small datasets may not be very representative. Let's generate 4 Billion unique strings of different length and check the quality of the hash function on them. To make that efficient, let's define a generator expression that will generate the strings on the fly. Each string is a printed integer representation from 0 to 4 Billion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "n = 1 * 1024 * 1024"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_proteins(n):\n",
    "    \"\"\"Generator expression to yield strings of printed integers from 0 to n.\"\"\"\n",
    "    alphabet = 'ACGT'\n",
    "    for _ in range(n):\n",
    "        yield ''.join(random.choices(alphabet, k=300))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collisions for `hash`: 223,848 ~ 21.3478%\n"
     ]
    }
   ],
   "source": [
    "populated_default = count_populated_synthetic(generate_proteins, n, hash)\n",
    "collisions_default = n - populated_default\n",
    "print(f\"Collisions for `hash`: {collisions_default:,} ~ {collisions_default / n:.4%}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collisions for `sz.hash`: 223,995 ~ 21.3618%\n"
     ]
    },
    {
     "ename": "",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31mThe Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
     ]
    }
   ],
   "source": [
    "populated_sz = count_populated_synthetic(generate_proteins, n, sz.hash)\n",
    "collisions_sz = n - populated_sz\n",
    "print(f\"Collisions for `sz.hash`: {collisions_sz:,} ~ {collisions_sz / n:.4%}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
